Video captioning is a problem that generates a natural language sentence as a video’s description. A video description includes not only words that express the objects in the video but also words that express the relationships between the objects, or grammatically necessary words. To reflect this characteristic explicitly using a deep learning model, we propose a multi-representation switching method. The proposed method consists of three components: entity extraction, motion extraction, and textual feature extraction...........
Loading....